Computing text semantic relatedness using the contents and links of a hypertext encyclopedia
نویسندگان
چکیده
We propose a method for computing semantic relatedness between words or texts by using knowledge from hypertext encyclopedias such as Wikipedia. A network of concepts is built by filtering the encyclopedia’s articles, each concept corresponding to an article. Two types of weighted links between concepts are considered: one based on hyperlinks between the texts of the articles, and another one based on the lexical similarity between them. We propose and implement an efficient random walk algorithm that computes the distance between nodes, and then between sets of nodes, using the visiting probability from one (set of) node(s) to another. Moreover, to make the algorithm tractable, we propose and validate empirically two truncation methods, and then use an embedding space to learn an approximation of visiting probability. To evaluate the proposed distance, we apply our method to four important tasks in natural language processing: word similarity, document similarity, document clustering and classification, and ranking in information retrieval. The performance of the method is state-of-the-art or close to it for each task, thus demonstrating the generality of the knowledge resource. Moreover, using both hyperlinks and lexical similarity links improves the scores with respect to a method using only one of them, because hyperlinks bring additional real-world knowledge not captured by lexical similarity.
منابع مشابه
Computing Text Semantic Relatedness Using the Contents and Links of a Hypertext Encyclopedia: Extended Abstract
We propose methods for computing semantic relatedness between words or texts by using knowledge from hypertext encyclopedias such as Wikipedia. A network of concepts is built by filtering the encyclopedia’s articles, each concept corresponding to an article. A random walk model based on the notion of Visiting Probability (VP) is employed to compute the distance between nodes, and then between s...
متن کاملAutomatically generating hypertext in newspaper articles by computing semantic relatedness
We discuss an automatic method for the construction of hypertext links within and between newspaper articles. The method comprises three steps: determining the lexical chains in a text, building links between the paragraphs of articles, and building links between articles. Lexical chains capture the semantic relations between words that occur throughout a text. Each chain is a set of related wo...
متن کاملUsing natural language processing to construct large - scale hypertext systems
theory of how texts are connected (a theory of text association) and partial theories of what the text is describing (domain theories). However, systems built in this way, such as ASK Systems [Ferguson, et al . 1992], are difficult to build even when the domain theories and a theory of text association are in place . Natural language understanding technologies that take advantage of underlying ...
متن کاملBuilding hypertext links in newspaper articles using semantic similarity
We discuss an automatic method for the construction of hypertext links within and between newspaper articles. The method comprises three steps: determining the lexical chains in a text, building links between the paragraphs of articles, and building links between articles. Lexical chains capture the semantic relations between words that occur throughout a text. Each chain is a set of related wo...
متن کاملMeasuring of Semantic Relatedness between Words based on Wikipedia Links
A novel technique of semantic relatedness measurement between words based on link structure of Wikipedia was provided. Only Wikipedia’s link information was used in this method, which avoid researchers from burdensome text processing. During the process of relatedness computation, the positive effects of two-directional Wikipedia’s links and four link types are taken into account. Using a widel...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Artif. Intell.
دوره 194 شماره
صفحات -
تاریخ انتشار 2013